A Simple and Efficient MapReduce Algorithm for Data Cube Materialization

نویسندگان

  • Mukund Sundararajan
  • Qiqi Yan
چکیده

Data cube materialization is a classical database operator introduced in Gray et al. (Data Mining and Knowledge Discovery, Vol. 1), which is critical for many analysis tasks. Nandi et al. (Transactions on Knowledge and Data Engineering, Vol. 6) first studied cube materialization for large scale datasets using the MapReduce framework, and proposed a sophisticated modification of a simple broadcast algorithm to handle a dataset with a 216GB cube size within 25 minutes with 2k machines in 2012. We take a different approach, and propose a simple MapReduce algorithm which (1) minimizes the total number of copy-add operations, (2) leverages locality of computation, and (3) balances work evenly across machines. As a result, the algorithm shows excellent performance, and materialized a real dataset with a cube size of 35.0G tuples and 1.75T bytes in 54 minutes, with 0.4k machines in 2014. I. CUBE MATERIALIZATION As a concrete example, in the context of search engine advertising, a typical data analysis task can involve a dataset like in Table 1. Here for each region where the search engine’s users are from, for each category of queries the users entered, and for each advertiser who advertised on the result pages of the queries, we have the total count of ad impressions this advertiser showed. TABLE I INPUT DATASET country state city queryadvertiser count category US CA Mtn View Retail Amazon 400 CN ZJ Hangzhou Shopping Taobao 300 .. .. .. .. .. .. In general, in this paper, a dataset can contain a number of discrete hierarchical dimensions, and additive metrics. Here each hierarchical dimension is composed of some number of columns, with higher-level columns appearing to the left. E.g., the region dimension has three columns country, state, and city, and the advertiser dimension has just one column. For simplicity, we will deal with a single metric called count in the paper, with the understanding that our results extend easily to multiple metrics and algebraic measures [1]. Many analysis tasks are then concerned with the aggregate metrics for subsets of rows that can be defined by specifying concrete values for a subset of the columns, aggregating over the other columns. We call such subsets of rows as segments. (In the case of a hierarchical dimension, if a value is set for a lower level column (such as state), a value must be set for all higher level columns (such as country) as well.) For example, we could ask for the total ad impression count for the segment defined by country=US, state=*, city=*, query-category=* and advertiser=Amazon, where * means the column is aggregated over all its possible values. The key of a segment is the vector of the form (“US”, “CA”, *, “Retail”, “Amazon”), listing all the columns. Cube materialization was introduced in Gray et al. [1], which refers to the task of computing counts for all segments. Cube materialization is an important problem. By making all these segments’ counts precomputed and hence instantly available, it enables real-time reporting, efficient online analytical processing, and many data mining tasks [1]. Note that the number of all segments can be large, because for a dataset with n dimensions, there can be 2 segments that each input row contributes to. We call the ratio of number of outputs to number of inputs the blow-up ratio. The blow-up ratio can be easily greater than 10 for many datasets. Hence we measure the scale of a cube materialization problem by the size of the cube, which refers to the set of all segments. II. THREE IMPORTANT FACTORS FOR EFFICIENCY With the recent surge of “big data”, we need to do cube materialization for larger and larger datasets, and the focus of this paper is to do this at scale with a parallel computation framework such as MapReduce [2]. Nandi et al. [3] was the first paper to study cube materialization at large scale, for the class of partially-algebraic measures. We focus on additive and algebraic measures, and propose a simpler algorithm with better scaling properties. To aid our discussion, we identify three important factors for efficient algorithms below. Our algorithm was designed in a way that was focused around these factors, and we will compare our algorithm to Nandi et al.’s with respect to these factors. Minimizing Copy-Add Operations / Messages: In cube materialization, a basic operation or unit of work is copyadd. For example, to compute the aggregate for the segment of country=US, one option is to copy and add the counts of all input rows with country=US onto an aggregating variable for the segment of country=US. We note that all existing algorithms are based on this basic copy-add operation. Messages: We call such an copy-add operation a message, as it can be seen as every input row sends its aggregate as a message to the segment of country=US. A good algorithm should avoid an excessive number of messages. Locality: A copy-add operation a.k.a., message can be a remote one or a local one, depending on e.g. whether each input row with country=US is located in a different machine from the count for the segment of country=US. A remote message is well-known to be more expensive than an intramachine local message, by one or two orders of magnitude, and should be avoided as much as possible. Remark: Since cube materialization often leads to a blow-up in data size, it is often coupled with aggressive output filtering, with most output rows never written out. For this reason, we will ignore the cost of output writing, and do not count them as remote messages. Balance: To avoid the running time being dominated by a few straggling machines, a good algorithm needs to partition the data and all the local messages to be evenly distributed across machines. (Remote messages are evenly distributed automatically due to random sharding.) The challenge here is to deal with various kinds of data-skewness in real datasets. For example, if we shard the data based on advertiser’s id, the sharding can be uneven if a big advertiser contributed a significant fraction of input rows to the dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Parallel Adaptive Partial Materialization Method of Data Cube Based on Genetic Algorithm

Due to the limitation of computing and storage resources, online analysis of massive data is usually time consuming. Data cube materialization is an effective way to improve the performance of online analysis. Considering the potential parallelism of genetic algorithm and its good global searching ability, a materialization method of data cube based on genetic algorithm is proposed. This method...

متن کامل

Scalable Data Cube Analysis over Big Data

Data cubes are widely used as a powerful tool to provide multidimensional views in data warehousing and On-Line Analytical Processing (OLAP). However, with increasing data sizes, it is becoming computationally expensive to perform data cube analysis. The problem is exacerbated by the demand of supporting more complicated aggregate functions (e.g. CORRELATION, Statistical Analysis) as well as su...

متن کامل

Big Data Mining using Map Reduce: A Survey Paper

Big data is large volume, heterogeneous, distributed data. Big data applications where data collection has grown continuously, it is expensive to manage, capture or extract and process data using existing software tools. For example Weather Forecasting, Electricity Demand Supply, social media and so on. With increasing size of data in data warehouse it is expensive to perform data analysis. Dat...

متن کامل

Object-Based Selective Materialization for Efficient Implementation of Spatial Data Cubes

ÐWith a huge amount of data stored in spatial databases and the introduction of spatial components to many relational or object-relational databases, it is important to study the methods for spatial data warehousing and OLAP of spatial data. In this paper, we study methods for spatial OLAP, by integration of nonspatial OLAP methods with spatial database implementation techniques. A spatial data...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1709.10072  شماره 

صفحات  -

تاریخ انتشار 2017